A Web Site Classification Approach Based On Its Topological Structure

نویسندگان

  • Ji-bin Zhang
  • Zhi-ming Xu
  • Kun-li Xiu
  • Qi-shu Pan
چکیده

Automatic web site classification has a wide application prospect; however, there are few researches on it. Different from pure texts, web sites are the combination of a large number of web pages via hyperlinks, so text classification methods are not suitable to classify them directly. This paper proposes a web site classification approach based on its topological structure. Given a web site, firstly we represent its topological structure as a directed graph, and from which we extract a strongly connected sub-graph including the site’s home page. Secondly, we use an improved PageRank algorithm on the sub-graph to select some topic-relevant resources, and represent them as a topic vector of the site. Finally we use an SVM classifier to classify the site in term of its topic vector. Some experiments are conducted for web site classification. Experimental results show our approach achieved better performance than traditional super page-based web site classification approach.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Study of Stone-wales Defect on Elastic Properties of Single-layer Graphene Sheets by an Atomistic based Finite Element Model

In this paper, an atomistic based finite element model is developed to investigate the influence of topological defects on mechanical properties of graphene. The general in-plane stiffness matrix of the hexagonal network structure of graphene is found. Effective elastic modulus of a carbon ring is determined from the equivalence of molecular potential energy related to stretch and angular defor...

متن کامل

A Novel Approach to Feature Selection Using PageRank algorithm for Web Page Classification

In this paper, a novel filter-based approach is proposed using the PageRank algorithm to select the optimal subset of features as well as to compute their weights for web page classification. To evaluate the proposed approach multiple experiments are performed using accuracy score as the main criterion on four different datasets, namely WebKB, Reuters-R8, Reuters-R52, and 20NewsGroups. By analy...

متن کامل

Topological structure on generalized approximation space related to n-arry relation

Classical structure of rough set theory was first formulated by Z. Pawlak in [6]. The foundation of its object classification is an equivalence binary relation and equivalence classes. The upper and lower approximation operations are two core notions in rough set theory. They can also be seenas a closure operator and an interior operator of the topology induced by an equivalence relation on a u...

متن کامل

Ecological classification of southern intertidal zones of Qeshm Island, based on CMECS model

The “Coastal and Marine Ecological Classification Standard (CMECS)”, a new approach to ecological classification, was applied to 122 km of the southern intertidal zone of Qeshm Island located the Hormouz Strait - the Persian Gulf. Two components of this model, Surface Geology (SGC) and Biotic Cover (BCC) were used. Considering the extent and geomorphological alternations of the covered area, 12...

متن کامل

Two-Phase Web Site Classification Based on Hidden Markov Tree Models

With the exponential growth of both the amount and diversity of the information that the web encompasses, automatic classification of topic-specific web sites is highly desirable. In this paper we propose a novel approach for web site classification based on the content, structure and context information of web sites. In our approach, the site structure is represented as a twolayered tree in wh...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Int. J. of Asian Lang. Proc.

دوره 20  شماره 

صفحات  -

تاریخ انتشار 2010